Abstract

This objective of this project was to analyse some grocery data and conduct market basket analysis. Market basket analysis is a type of association analysis that aims to find associations/ correlations between certain items/groceries being purchased by customers. Such analysis is conducted so certain products can be recomended if certain products are already pruchased/ picked for purchase by a customer. In my analysis I used the python's MLExtend libarary, which uses the apriori algorithm to conduct association anlaysis, to find some associations between groceries purchased by members of a supermarket. The data was sourced from kaggle at https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset. In my analysis I have found that there were not concrete and significant relationships between the grocieries being purchased by customers. However certain products (X) had a considerable amount of chance of influencing the purchase of another product (Y), but the products in the relationships were not purchased as often together to make note of and act on their potential relationship.

This lack of associations being observed in the data could be due to the fact that 10,000 out of 15000 transactions only included 2 items, which were maybe purchased without any relation to each other. So perhaps if we had more more data in general, with more transactions including more products, maybe more significant associations could have been found. Regardless this notebook accounts my attempts to explore associations and the procedure used to arrive at my conclusions.

IDE and ensuring data quality

renaming columns:

Ensuring data types of the values in the data corresponding to each attribute is suitable:

Member_number is a categorical variable in this data rather than a quantitative variable as it is not unique and would be used to grouped togethor member numbres to find the purchases of a single member on a particular day. Thus, it is better a string (object) than an integer:

The date should also be of type datetime and we can add a year a month column:

Checking for any null/ NA values (no values can be observed --> False):

Top 10 most purchased items

The bar chart below shows that whole milk is the most popular item purchased, next is 'other vegetables', then rolls/buns, then soda and then tropical fruit etc.

Top 10 most regular customers

Finding number of items sold monthly

Finding associations

For the MLExtend library to properly find associations, each row of the data provided should be a transaction in a 1- hot encoded format. Which means each row should be a transaction and in this row, if a certain item has been purchased then a 1 should be denoted and a 0 if the item has not been purchased (scroll down to see this format).Below follows the code to achieve such a format.

The data is being stored in a dictionary, where each key in the dictionary will contain a member id and a date on which they made a transactin. Each key will contain a list of the items purchased in the transaction on that date by the member. Below I check my dictionary with member "1808" and cross check with the source data:

Next I will create a new dictionary which will be converted into a dataframe. This new dictionary will contain a transaction number and products as a key (which will become the columns of the data frame) and will have a list containing 1's and 0's for their value (this list will be become rows of the data frame). The 1-hot encoded format can be observed:

Finding the most frequent groceries in each transaction using the apriori algorithm. The most frequent is 'whole milk' which appears 15% of the time and the next most frequent are other vegetables, roll/buns, soda and yogurt which appear 12% to 8% of the time. This is not good for associations to be generated as there are not many items which occur a lot in the transactions. Occuring more times would result in associations being formed with other items if they also occur frequently, but no two items the data set are occuring regulary as whole milk appears only 15% of the time and the second most frequent 'other vegetables' occurs only 12% of the time.

Finding associations using MLExtend's association_rules() function:

Analysing our results

Now that we have generated our associations we can see that 734 rules have been generated. Of those only some rules are worth investigating. The ones that are worth looking more at are rules that have the highest supports, high confidences and a lift score of greater than 1. We will only look at rules with lifts greater than 1 as this would imply that the antecedent (X) and consequent (Y) are not independent, which means that that X has an effect on Y.

Investigating potential high supports

The support of an association between antecedents (X) and consequents (Y), is in essence the probability of both X and Y occuring in a transaction of a customer. If we look at the results generated below, under the condition that the lift is greater than 1, we can see that only 240 rows are worth looking at and the confidences of the rules range from 0.5% (top row) to 0.1% (last row). This shows that, in our data set, all these rules have occured 0.5% to 0.1% percent of the time. This means that unfortunately the associations/ rules may not be very significant or applicable. This means that there are really no relationships between certain products, but this could be very well due to the data, and perhaps more data may lead to more concrete and substational associations.

Investigating confidence

Confidence is the probability of the consequent (Y) being in a transaction given that the antecedents (X) has already been purchased/ picked by the customer. Confidence is useful as we can observe the relationships between products.

As all the associations share low support they may not be significant. However, it is still worth investigating the confidence of these rules to see if certain products are 'related' to each other even though they have a low chance of actually being purchased.

The results below with lifts greater than 1 show that the confidences of 240 rules range from 25% to 0.6%. The top most result shows that customers who purchased yogurt and sausages togethor (0.5% chance of this actually occuring) have a reasonable 25% chance of purchasing whole milk (has a 15% chance of being purchased in general, also relatively low). Since the chance of a customer purchasing yogurt and sausage is low (antecedent support of 0.5%), the associations support is also greatly reduced even though whole milk has an okay individual support (consequent support) of 15%. Since this rule has a very low chance of 0.1% of occuring the association isn't really useful but interesting to know for future expansions to the data, which could maybe increase the support of this association.

Top 10 associations with high confidence

Exploring a potential reason for the lack of significant associations

We can see in the bar plot below, 10000 out of 15000 transactions only included 2 items. Perhaps data regarding more items in each transaction could've helped find more relationships due to potentially more common pairs or sets of items.

References

https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

https://pbpython.com/market-basket-analysis.html

https://en.wikipedia.org/wiki/Association_rule_learning